Notes: The points on the scatter plot are just the big cluster at the bottom
Notes:
library(ggplot2)
pf <- read.delim("pseudo_facebook.tsv")
qplot(age, friend_count, data = pf)
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point()
Response: It looks like younger users have a lot of friends. There are some vertical bars where people have lied about their age, like 69 and also about 1000. Those users are also likely to be teenagers or perhaps fake accounts given these really high friend counts.
Notes:
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point() +
xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha = 0.05) +
xlim(13, 90)
## Warning: Removed 5188 rows containing missing values (geom_point).
Response: With this new plot, we can see that the friend count for young users aren’t nearly as high as they looked before
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 0.05) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 0.05, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 5182 rows containing missing values (geom_point).
We can see the thresholds of friend count above which there are very few users.
Notes:
# This programming assignment
# will not be graded, but when you
# submit your code, the assignment
# will be marked as correct. By submitting
# your code, we can add to the feedback
# messages and address common mistakes
# in the Instructor Notes.
# You can assess your work by watching
# the solution video.
# Examine the relationship between
# friendships_initiated (y) and age (x)
# using the ggplot syntax.
# We recommend creating a basic scatter
# plot first to see what the distribution looks like.
# and then adjusting it by adding one layer at a time.
# What are your observations about your final plot?
# Remember to make adjustments to the breaks
# of the x-axis and to use apply alpha and jitter.
# ENTER ALL OF YOUR CODE FOR YOUR PLOT BELOW THIS
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point(alpha = 0.05, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 5181 rows containing missing values (geom_point).
Notes: Percentage transform
Notes:
# install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)
## # A tibble: 6 x 4
## age friend_count_mean friend_count_median n
## <int> <dbl> <dbl> <int>
## 1 13 165. 74 484
## 2 14 251. 132 1925
## 3 15 348. 161 2618
## 4 16 352. 172. 3086
## 5 17 350. 156 3283
## 6 18 331. 162 5196
pf.fc_by_age <- pf %>%
group_by(age) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age)
head(pf.fc_by_age)
## # A tibble: 6 x 4
## age friend_count_mean friend_count_median n
## <int> <dbl> <dbl> <int>
## 1 13 165. 74 484
## 2 14 251. 132 1925
## 3 15 348. 161 2618
## 4 16 352. 172. 3086
## 5 17 350. 156 3283
## 6 18 331. 162 5196
Create your plot!
# Plot mean friend count vs. age using a line graph.
# Be sure you use the correct variable names
# and the correct data frame. You should be working
# with the new data frame created from the dplyr
# functions. The data frame is called 'pf.fc_by_age'.
# Use geom_line() rather than geom_point to create
# the plot. You can look up the documentation for
# geom_line() to see what it does.
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
# oddness at age 69
# For our young users, they still have high friend counts,
# and for the ages between 30 and 60, the mean count is hovering just about over 100.
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
xlim(13, 90) +
geom_point(alpha = 0.05,
position = position_jitter(h=0),
color = "orange") +
coord_trans(y = "sqrt") +
geom_line(stat = "summary", fun.y = mean) +
geom_line(stat = "summary", fun.y = quantile,
fun.args = list(probs = .1), linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile,
fun.args = list(probs = .5), linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile,
fun.args = list(probs = .9), linetype = 2, color = "blue")
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5185 rows containing missing values (geom_point).
ggplot(aes(x = age, y = friend_count), data = pf) +
coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
geom_point(alpha = 0.05,
position = position_jitter(h=0),
color = "orange") +
geom_line(stat = "summary", fun.y = mean) +
geom_line(stat = "summary", fun.y = quantile,
fun.args = list(probs = .1), linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile,
fun.args = list(probs = .5), linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile,
fun.args = list(probs = .9), linetype = 2, color = "blue")
Response: we can see that for 35 year olds to 60 year olds, the friend count falls below 250. So 90% of our users between this age group have less than 250 friends.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes: pass
Notes:
cor.test(pf$age, pf$friend_count, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
with(pf, cor.test(pf$age, pf$friend_count, method = "pearson"))
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response: -0.0274(no meaningful relationship)
Notes:
with(subset(pf, age < 70), cor.test(age, friend_count,
method = "pearson"))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.326, df = 90664, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1775257 -0.1648889
## sample estimates:
## cor
## -0.1712144
Notes: Correlation Methods: Pearson’s r, Spearman’s rho, and Kendall’s tau
Notes:
# Create a scatterplot of likes_received (y)
# vs. www_likes_received (x). Use any of the
# techniques that you've learned so far to
# modify the plot.
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point()
Notes:
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point() +
coord_cartesian(xlim = c(0, quantile(pf$www_likes_received, 0.95)),
ylim = c(0, quantile(pf$likes_received, 0.95))) +
geom_smooth(method = "lm", color = "red")
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
cor.test(pf$www_likes_received, pf$likes_received)
##
## Pearson's product-moment correlation
##
## data: pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response: 0.948, One of them was really a super set of the other.
Notes: So typically, when I’m working on a problem and I’m going to be doing some kind of regression where I’m modeling something.
I’m going to be throwing some of these variables into the regression. And one of the assumptions of regression is these variables are independent of each other.
And so if any two are too highly correlated with each other, it will be really difficult to tell which ones are actually driving the phenomenon.
And so it’s important to measure the correlation between your variables first, often because it’ll help you determine which ones you don’t actually want to throw in together, and it might help you decide which ones you actually want to keep.
Notes:
#install.packages('alr3')
library(alr3)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
data("Mitchell")
?Mitchell
Create your plot!
# Create a scatterplot of temperature (Temp)
# vs. months (Month).
ggplot(data = Mitchell, aes(x = Month, y = Temp)) +
geom_point()
qplot(data = Mitchell, Month, Temp)
cor.test(Mitchell$Month, Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes:
ggplot(data = Mitchell, aes(x = Month, y = Temp)) +
geom_point() +
scale_x_continuous(breaks = seq(0, 203, 12))
What do you notice? Response: a cyclical pattern, like a sine of cosine graph
Watch the solution video and check out the Instructor Notes! Notes:
ggplot(data = Mitchell, aes(x = (Month%%12), y = Temp)) +
geom_point()
Notes:
# Create a new variable, 'age_with_months', in the 'pf' data frame.
# Be sure to save the variable in the data frame rather than creating
# a separate, stand-alone variable. You will need to use the variables
# 'age' and 'dob_month' to create the variable 'age_with_months'.
# Assume the reference date for calculating age is December 31, 2013.
pf$age_with_months <- pf$age + (12 - pf$dob_month) / 12
pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)
pf$age_with_months <- with(pf, age + (1 - dob_month / 12))
# Create a new data frame called
# pf.fc_by_age_months that contains
# the mean friend count, the median friend
# count, and the number of users in each
# group of age_with_months. The rows of the
# data framed should be arranged in increasing
# order by the age_with_months variable.
# For example, the first two rows of the resulting
# data frame would look something like...
# age_with_months friend_count_mean friend_count_median n
# 13 275.0000 275 2
# 13.25000 133.2000 101 11
# See the Instructor Notes for two hints if you get stuck.
# This programming assignment will automatically be graded.
pf <- read.delim('pseudo_facebook.tsv')
pf$age_with_months <-pf$age + (1 - pf$dob_month / 12)
suppressMessages(library(dplyr))
Programming Assignment
pf.fc_by_age_months <- pf %>%
group_by(age_with_months) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age_with_months)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
## age_with_months friend_count_mean friend_count_median n
## <dbl> <dbl> <dbl> <int>
## 1 13.2 46.3 30.5 6
## 2 13.2 115. 23.5 14
## 3 13.3 136. 44 25
## 4 13.4 164. 72 33
## 5 13.5 131. 66 45
## 6 13.6 157. 64 54
age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)
head(pf.fc_by_age_months2)
## # A tibble: 6 x 4
## age_with_months friend_count_mean friend_count_median n
## <dbl> <dbl> <dbl> <int>
## 1 13.2 46.3 30.5 6
## 2 13.2 115. 23.5 14
## 3 13.3 136. 44 25
## 4 13.4 164. 72 33
## 5 13.5 131. 66 45
## 6 13.6 157. 64 54
# Create a new line plot showing friend_count_mean versus the new variable,
# age_with_months. Be sure to use the correct data frame (the one you created
# in the last exercise) AND subset the data to investigate users with ages less
# than 71.
ggplot(data = subset(pf.fc_by_age_months, age_with_months < 71),
aes(x = age_with_months, y = friend_count_mean)) +
geom_line()
Notes:
p1 <- ggplot(data = subset(pf.fc_by_age, age < 71),
aes(x = age, y = friend_count_mean)) +
geom_line() +
geom_smooth()
p2 <- ggplot(data = subset(pf.fc_by_age_months, age_with_months < 71),
aes(x = age_with_months, y = friend_count_mean)) +
geom_line() +
geom_smooth()
p3 <- ggplot(data = subset(pf.fc_by_age, age < 71),
aes(x = round(age / 5) * 5, y = friend_count_mean)) +
geom_line(stat = "summary", fun.y = mean)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p2, p1, p3, ncol = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Notes: One important answer is that you don’t have to choose. In exploratory data analysis, we’ll often create multiple visualizations and summaries of the same data, gleaning different incites from each.
Reflection: We covered scatter plots, conditional means, and correlation coefficients.
we learned how to explore the relationship between two variables.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!